Automatic Normalization of Word Variations in Code-Mixed Social Media Text
نویسندگان
چکیده
Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend multiple code-mixed data has recently become research communities for various NLP tasks. Code-mixed consist anomalies grammatical errors spelling variations. In this paper, we leverage the contextual property words where different variation share similar context a large noisy social text. We capture variations belonging to same an unsupervised manner using distributed representations words. Our experiments reveal that preprocessing dataset based on our approach improves performance state-of-the-art part-of-speech tagging (POS-tagging) sentiment analysis
منابع مشابه
Automatic Normalization of Word Variations in Code-Mixed Social Media Text
Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this pape...
متن کاملSentiment Identification in Code-Mixed Social Media Text
Sentiment analysis is the Natural Language Processing (NLP) task dealing with the detection and classification of sentiments in texts. While some tasks deal with identifying presence of sentiment in text (Subjectivity analysis), other tasks aim at determining the polarity of the text categorizing them as positive, negative and neutral. Whenever there is presence of sentiment in text, it has a s...
متن کاملIdentifying Languages at the Word Level in Code-Mixed Indian Social Media Text
Language identification at the document level has been considered an almost solved problem in some application areas, but language detectors fail in the social media context due to phenomena such as utterance internal code-switching, lexical borrowings, and phonetic typing; all implying that language identification in social media has to be carried out at the word level. The paper reports a stu...
متن کاملPart-of-speech Tagging of Code-Mixed Social Media Text
A common step in the processing of any text is the part-of-speech tagging of the input text. In this paper, we present an approach to tackle code-mixed text from three different languages Bengali, Hindi, and Tamil apart from English. Our system uses Conditional Random Field, a sequence learning method, which is useful to capture patterns of sequences containing code switching to tag each word w...
متن کاملShallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text
In this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community w...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2023
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-031-23793-5_30